Overview

data summary

About our Data

Our data contains real estate data from portions of North Carolina’s Wake County, Chatham County and Durham County. The dataset contains about 270,000 observations and 63 variables containing real estate data - including but not limited to houses, apartments and commercial property. After cleaning our data, the dataset only contains data from Wake County and contains about 64,000 observations and 22 variables.

Our responsibilities are divided as follows:

  • Ariq:
    • Natural Cubic Splines
    • k-Nearest Neighbors Classification
    • Ridge Regression
  • Marcos:
    • Multiple Linear Regression
    • Naive Bayes Classification
    • Logistic Regression
  • Both:
    • Compile dashboard and check each other’s work

Table Summary

Table Summary

Name Description
Calculated.Acreage Calculated area from property in acres
Land.Class Land classification description
Land.Class.Code Land classification code
Total.Structures Total number of structures on the property
Total.Units Total number of units on the property
Building.Value Revenue Dept. assessed value for structures contained within the property
Land.Value Revenue Dept. assessed value for land contained within the property
Land.Sale.Value US dollar value for the land when last sold.
Land.Sale.Date Date that land was last sold.
Total.Sale.Value US dollar value for the land and building(s) when last sold.
Total.Sale.Date Date that land and/or building(s) when last sold.
WC.ETJ Corporate limits where property is located.
Billing.Class Billing classifications for Revenue Department use.
APA.Ownership.Description American Planning Association (APA) ownership description.
APA.Activity.Description APA activity description.
APA.Function.Description APA function description.
APA.Site.Description APA site description.
Total.Building.Square.Footage Total square footage of the structure.
Type.And.Use.Description Building use and type description
Phy.City City where property is located.
Shape.STArea Property structure area.
Year.Built Year property was built.

Source: Town of Cary Dataset Schema

Multiple Linear Regression

Column

Research & Model Selection

My goal for this section is to model what maximizes the total sale value of a plot of land from this dataset. Initially I wanted to include both numeric and categorical data in my model but I would run into some problems – more on that later. Using the cleaned dataset, and all possible subsets, I picked out seven regressors and compared them and their criterion to each other: Calculated.Acreage, Total.Structures, Total.Units, Building.Value, Land.Value, Land.Sale.Value, and Total.Building.Square.Footage. The following shows these results.

Acreage Sructures Units Building Land Land Sale Sq. Footage AdjR2 Cp BIC
* 0.7452296 9596.42695 -93695.7
* * 0.7725733 1212.01377 -101466.9
* * * 0.7740828 750.09264 -101913.2
* * * * 0.7756057 284.09159 -102366.6
* * * * * 0.7761798 109.05205 -102532.0
* * * * * * 0.7764352 31.72609 -102600.2
* * * * * * * 0.7765158 8.00000 -102614.8

As shown above, each criterion points to the fact that the full model is the most accurate model. This seems like a great fit to the data – however, it’s not. The model’s variance inflation factor (VIF) and the data’s scatterplot matrix, as shown to the right, proves that there is collinearity between the regressors.

           Calculated.Acreage              Total.Structures 
                     2.419908                      3.413634 
                  Total.Units                Building.Value 
                     2.232743                     11.698253 
                   Land.Value               Land.Sale.Value 
                     9.004143                      2.016888 
Total.Building.Square.Footage 
                    16.285497 

This calls for a reduced model.

Reduced Model & Diagnostics

Our new best model regresses Total.Sale.Value with Calculated.Acreage, Total.Structures, Total.Units, and Land.Sale.Value. The model summary and VIF is shown below.


Call:
lm(formula = Total.Sale.Value ~ Calculated.Acreage + Total.Structures + 
    Total.Units + Land.Sale.Value, data = df_mlr)

Residuals:
      Min        1Q    Median        3Q       Max 
-65277755   -119735    -20383     85065  95479184 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)        -1.405e+06  1.511e+04  -93.01   <2e-16 ***
Calculated.Acreage  3.610e+05  6.232e+03   57.93   <2e-16 ***
Total.Structures    1.591e+06  1.532e+04  103.88   <2e-16 ***
Total.Units         3.762e+04  4.702e+02   79.99   <2e-16 ***
Land.Sale.Value     8.212e-01  1.722e-02   47.69   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1379000 on 68532 degrees of freedom
Multiple R-squared:  0.6066,    Adjusted R-squared:  0.6066 
F-statistic: 2.642e+04 on 4 and 68532 DF,  p-value: < 2.2e-16
Calculated.Acreage   Total.Structures        Total.Units    Land.Sale.Value 
          1.931941           2.333885           1.820155           1.285553 

The calculated acreage, the land sale value, and the total number of structures and units have a direct correlation against the total sale value of a property. This model explains about 60.66% of the variance, as seen by the adjusted-\(R^2\) value.

However, there’s a problem: none of the assumptions for multiple linear regression are satisfied, and a log transform is necessary. The only problem that comes with this is that Total.Units cannot be used anymore since a large portion of the data contains zero.

Final Model

The final model is as follows:


Call:
lm(formula = log(Total.Sale.Value) ~ log(Calculated.Acreage) + 
    log(Total.Structures) + log(Land.Sale.Value), data = df_mlr)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.0474 -0.2003 -0.0033  0.2204  4.4988 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)             6.859083   0.030204  227.09   <2e-16 ***
log(Calculated.Acreage) 0.101919   0.002066   49.33   <2e-16 ***
log(Total.Structures)   0.713479   0.018643   38.27   <2e-16 ***
log(Land.Sale.Value)    0.554229   0.002730  203.01   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4743 on 68533 degrees of freedom
Multiple R-squared:  0.4493,    Adjusted R-squared:  0.4493 
F-statistic: 1.864e+04 on 3 and 68533 DF,  p-value: < 2.2e-16

Looking at the graphs to the right (aside from collinearity), you can see a side-by-side comparison showing how the log transform has satisfied all assumptions but normality. With this in mind, this model might not produce the most accurate or precise results. Nonparametric methods should be considered when asking this question.

Column

Collinearity

Variance

Normality

Leverage

Ridge Regression

Column

Creating the Initial Model

The first step is to create a ridge regression model using the glmnet library without any \(\lambda\) yet. Below is a summary of said model. In the plot tabs, you can see the plot of the coefficients for this model.

Length Class Mode
a0 100 -none- numeric
beta 300 dgCMatrix S4
df 100 -none- numeric
dim 2 -none- numeric
lambda 100 -none- numeric
dev.ratio 100 -none- numeric
nulldev 1 -none- numeric
npasses 1 -none- numeric
jerr 1 -none- numeric
offset 1 -none- logical
call 4 -none- call
nobs 1 -none- numeric

Finding the Best \(\lambda\)

To find our best \(\lambda\) value we will be cross-validating our models. The glmnet library has a built-in function for that and the process can be seen visualized under “Plotting Cross-Validated Model”. To see if using this improved the previous model, we calculate our adjusted \(R^2\) to see how much variation the model takes into account. The coefficients plot with our best \(\lambda\) is in the “Plotting Best Model” tab.

Below is \(\lambda\) value that best fits our data.

[1] 1462622

Below are the coefficients for the model with said \(\lambda\).

4 x 1 sparse Matrix of class "dgCMatrix"
                              s0
(Intercept)        -1.125869e+06
Calculated.Acreage  3.956674e+05
Total.Structures    1.364952e+06
Land.Sale.Value     7.783176e-01

Below is the summary of said model.

Length Class Mode
a0 1 -none- numeric
beta 3 dgCMatrix S4
df 1 -none- numeric
dim 2 -none- numeric
lambda 1 -none- numeric
dev.ratio 1 -none- numeric
nulldev 1 -none- numeric
npasses 1 -none- numeric
jerr 1 -none- numeric
offset 1 -none- logical
call 5 -none- call
nobs 1 -none- numeric

The whole point of using this method was to see if we could find an improved model that addresses multicollinearity. After calculating the adjusted \(R^2\), it seems that this model is a slight improvement with the same variables and predicted value from the previous model selection.

[1] 0.5225592

Column

First Model Coeffecients Plot

Plotting Cross-Validated Model

Plotting Best Model

Logistic Regression

Column

All Possible Subsets

What makes a piece of property expensive? I wanted to answer this question using logistic regression as my classification model. To answer this question, I created a new Boolean variable titled Is.Expensive, where it returns true if the total sale value of the property was over $489,000, the top 25th percentile in our dataset. Before picking out the best model, I filtered out all property where the total sale value was over $832,500 since anything over that threshold is considered an outlier. Using this new variable, I used all possible subsets to find the best variables for my model. The table showing all possible subsets is shown below:

Acreage Sructures Units Building Sq. Footage Expensive? AdjR2 Cp BIC
* 0.3319078 98.975288 -26014.95
* * 0.3323767 54.599215 -26050.20
* * * 0.3326722 27.008328 -26068.70
* * * * 0.3328098 14.691170 -26071.94
* * * * * 0.3329046 6.519005 -26071.04
* * * * * * 0.3329100 7.000000 -26061.48

As shown above, while other criterion point to multiple models having the best fit, Mallows’ \(C_p\) points to the model with five regressors being the best fit. I proceeded with the model classifying expensiveness to calculated acreage, total structures, building value and square footage.

The following shows the summary and VIF of the chosen model.


Call:
glm(formula = Is.Expensive ~ Calculated.Acreage + Total.Structures + 
    Building.Value + Total.Building.Square.Footage, family = "binomial", 
    data = df_log2)

Coefficients:
                                Estimate Std. Error z value Pr(>|z|)    
(Intercept)                   -5.504e+00  6.890e-01  -7.989 1.36e-15 ***
Calculated.Acreage            -1.830e-01  2.913e-02  -6.283 3.33e-10 ***
Total.Structures              -1.453e+00  6.879e-01  -2.113   0.0346 *  
Building.Value                 1.459e-05  2.348e-07  62.157  < 2e-16 ***
Total.Building.Square.Footage  1.840e-04  2.952e-05   6.233 4.57e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 66676  on 64552  degrees of freedom
Residual deviance: 43456  on 64548  degrees of freedom
AIC: 43466

Number of Fisher Scoring iterations: 5
           Calculated.Acreage              Total.Structures 
                     1.048652                      1.002820 
               Building.Value Total.Building.Square.Footage 
                     2.895133                      2.875376 

After looking at the scatterplot matrix and the VIF of this model, I feel like there seems to be strong evidence of the building value and square footage to have collinearity. With that in mind, I refit a smaller model without the building value of the property.


Call:
glm(formula = Is.Expensive ~ Calculated.Acreage + Total.Structures + 
    Total.Building.Square.Footage, family = "binomial", data = df_log2)

Coefficients:
                                Estimate Std. Error z value Pr(>|z|)    
(Intercept)                   -5.131e+00  6.493e-01  -7.902 2.74e-15 ***
Calculated.Acreage            -2.808e-02  2.707e-02  -1.037   0.2997    
Total.Structures              -1.559e+00  6.484e-01  -2.404   0.0162 *  
Total.Building.Square.Footage  1.829e-03  1.741e-05 105.045  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 66676  on 64552  degrees of freedom
Residual deviance: 48088  on 64549  degrees of freedom
AIC: 48096

Number of Fisher Scoring iterations: 5
           Calculated.Acreage              Total.Structures 
                     1.036145                      1.005022 
Total.Building.Square.Footage 
                     1.032090 

Now the VIF of the model is extremely small, but the calculated acreage is no longer significant. With this, we can show our final model on the next page.

Final Model

The final model (and the model the logit graph is based on) is shown below.


Call:
glm(formula = Is.Expensive ~ Total.Structures + Total.Building.Square.Footage, 
    family = "binomial", data = df_log2)

Coefficients:
                                Estimate Std. Error z value Pr(>|z|)    
(Intercept)                   -5.089e+00  6.523e-01  -7.802 6.09e-15 ***
Total.Structures              -1.603e+00  6.513e-01  -2.461   0.0139 *  
Total.Building.Square.Footage  1.826e-03  1.714e-05 106.497  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 66676  on 64552  degrees of freedom
Residual deviance: 48089  on 64550  degrees of freedom
AIC: 48095

Number of Fisher Scoring iterations: 5
             Total.Structures Total.Building.Square.Footage 
                     1.000806                      1.000806 

Only the number of structures and the total building square footage are the predictors. With this we can model our logit function, and after running cross-validation, we only have about a 12% misclassification rate, shown below.

[1] 0.1160497

Column

Scatterplot Matrix

Logit Grpah

Natural Cubic Splines

Column

Different Degrees

First, I filtered the data so it only includes the lans and land sale value between $250,000 and $900,000 that way outliers aren’t included in our regression model. I also sampled the data to only 3000 observations for legibility purposes. Then, I fit models between one to five degrees of freedom in order to find our best model. Below is a summary of the five models.


Call:
lm(formula = Total.Sale.Value ~ ns(Land.Value, df = 1), data = ns_df)

Residuals:
    Min      1Q  Median      3Q     Max 
-648746  -93626  -11453   78766  640988 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)              179847       6280   28.64   <2e-16 ***
ns(Land.Value, df = 1)   760240      22217   34.22   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 143400 on 2998 degrees of freedom
Multiple R-squared:  0.2809,    Adjusted R-squared:  0.2806 
F-statistic:  1171 on 1 and 2998 DF,  p-value: < 2.2e-16

Call:
lm(formula = Total.Sale.Value ~ ns(Land.Value, df = 2), data = ns_df)

Residuals:
    Min      1Q  Median      3Q     Max 
-604649  -92583  -11297   77442  666189 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)               135267      11669   11.59   <2e-16 ***
ns(Land.Value, df = 2)1   699665      24071   29.07   <2e-16 ***
ns(Land.Value, df = 2)2   429094      23937   17.93   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 143000 on 2997 degrees of freedom
Multiple R-squared:  0.2858,    Adjusted R-squared:  0.2853 
F-statistic: 599.5 on 2 and 2997 DF,  p-value: < 2.2e-16

Call:
lm(formula = Total.Sale.Value ~ ns(Land.Value, df = 3), data = ns_df)

Residuals:
    Min      1Q  Median      3Q     Max 
-605853  -92744  -11497   77232  668890 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)               128687      22702   5.669 1.58e-08 ***
ns(Land.Value, df = 3)1   327241      13933  23.486  < 2e-16 ***
ns(Land.Value, df = 3)2   648590      50789  12.770  < 2e-16 ***
ns(Land.Value, df = 3)3   485702      31597  15.372  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 143000 on 2996 degrees of freedom
Multiple R-squared:  0.2858,    Adjusted R-squared:  0.2851 
F-statistic: 399.7 on 3 and 2996 DF,  p-value: < 2.2e-16

Call:
lm(formula = Total.Sale.Value ~ ns(Land.Value, df = 4), data = ns_df)

Residuals:
    Min      1Q  Median      3Q     Max 
-596912  -93042  -10116   77587  688019 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)                56195      35167   1.598     0.11    
ns(Land.Value, df = 4)1   294880      33244   8.870   <2e-16 ***
ns(Land.Value, df = 4)2   416004      25431  16.358   <2e-16 ***
ns(Land.Value, df = 4)3   782249      75395  10.375   <2e-16 ***
ns(Land.Value, df = 4)4   480473      34039  14.115   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 142800 on 2995 degrees of freedom
Multiple R-squared:  0.288, Adjusted R-squared:  0.2871 
F-statistic: 302.9 on 4 and 2995 DF,  p-value: < 2.2e-16

Call:
lm(formula = Total.Sale.Value ~ ns(Land.Value, df = 5), data = ns_df)

Residuals:
    Min      1Q  Median      3Q     Max 
-586565  -90033  -10738   78083  667982 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)               165509      40884   4.048 5.29e-05 ***
ns(Land.Value, df = 5)1   222769      38195   5.832 6.04e-09 ***
ns(Land.Value, df = 5)2   179367      43262   4.146 3.48e-05 ***
ns(Land.Value, df = 5)3   464473      26480  17.541  < 2e-16 ***
ns(Land.Value, df = 5)4   501294      89107   5.626 2.02e-08 ***
ns(Land.Value, df = 5)5   375920      38653   9.725  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 141900 on 2994 degrees of freedom
Multiple R-squared:  0.297, Adjusted R-squared:  0.2958 
F-statistic:   253 on 5 and 2994 DF,  p-value: < 2.2e-16

To determine the best degree of freedom, I calculated each spline’s sum of square error by taking the sum of the residuals squared.

df and SSE for NS models
df SSE
1 6.167360e+13
2 6.125475e+13
3 6.124978e+13
4 6.106117e+13
5 6.028909e+13

Best Model

From the SSE table and the elbow plot, the best model appears to be of degree 5. Below is a summary of the model.


Call:
lm(formula = Total.Sale.Value ~ ns(Land.Value, df = 5), data = ns_df)

Residuals:
    Min      1Q  Median      3Q     Max 
-586565  -90033  -10738   78083  667982 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)               165509      40884   4.048 5.29e-05 ***
ns(Land.Value, df = 5)1   222769      38195   5.832 6.04e-09 ***
ns(Land.Value, df = 5)2   179367      43262   4.146 3.48e-05 ***
ns(Land.Value, df = 5)3   464473      26480  17.541  < 2e-16 ***
ns(Land.Value, df = 5)4   501294      89107   5.626 2.02e-08 ***
ns(Land.Value, df = 5)5   375920      38653   9.725  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 141900 on 2994 degrees of freedom
Multiple R-squared:  0.297, Adjusted R-squared:  0.2958 
F-statistic:   253 on 5 and 2994 DF,  p-value: < 2.2e-16

Column

Plots for Models with Different Degrees

Elbow Plot

Best Model Plotted

Naive Bayes

Column

Accuracy rate

77%

Observations tested

20261

Training/testing split

70-30%

Confusion Matrix

Column

Research Question & Classification Process

Let’s pretend that you’re a property developer building single-family houses and you bought a piece of land and want to estimate how much money your residents are going to pay per month in rent. Over the span of three years, about how much will they need to pay per month for your net income to break even?

I wanted to view the relationship between the land value, acreage and total sale value of a plot of residential land. Since Naive Bayes is much better suited for categorical data than numeric data, I created intervals for these variables. I would calculate the estimated rent as follows:

\[\hbox{Rent} = \frac{1}{36} \, \left(\hbox{Land Value - Total Sale Value}\right)\]

Below is a table summarizing these intervals:

Interval Description
Calculated “Small” if smaller than 0.4 acres, “Large” if greater
Land Value Less than $50k, between $50k and $100k or over $100k
Total Sale Value Less than $200k, between $200k and $400k or over $400k
Estimated Rent Less than $5k, between $5k and $10k or over $10k

With this in mind, the output of the confusion matrix is shown below:

Confusion Matrix and Statistics

         Actual
Predicted  <5k >10k 5k-10k
   <5k    2783   12      9
   >10k     25 5845   1955
   5k-10k 2524   71   7037

Overall Statistics
                                          
               Accuracy : 0.7732          
                 95% CI : (0.7673, 0.7789)
    No Information Rate : 0.4443          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.6452          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       

Statistics by Class:

                     Class: <5k Class: >10k Class: 5k-10k
Sensitivity              0.5219      0.9860        0.7818
Specificity              0.9986      0.8619        0.7695
Pos Pred Value           0.9925      0.7470        0.7306
Neg Pred Value           0.8540      0.9933        0.8152
Prevalence               0.2632      0.2926        0.4443
Detection Rate           0.1374      0.2885        0.3473
Detection Prevalence     0.1384      0.3862        0.4754
Balanced Accuracy        0.7603      0.9239        0.7757

kNN Classification

Column

The Method

I created a new column, Time.Period, which is a categorical variable consisting of three different values: Old for property built before 1995, Modern for properties built between 1995 and 2019, and New for properties built after 2019. For visualization purposes, I reduced the dataset to a sample of 5,000 observations. Then, I proceeded to split the training and testing data in a 70/30% split, running the kNN classifier with our testing data (\(k=5\)), constructed the confusion matrix and got a classification rate of 76%.

A problem running this method is that properties are not evenly skewed throughout the years. We can’t account for older properties built, so there is a heavy bias towards classifying something as Modern since a good portion of the sales happened after 2019.

Confusion Matrix and Statistics

                   Reference
Prediction          Modern(1995-2019) New(2019-) Old(Before 1995)
  Modern(1995-2019)              1076         89              135
  New(2019-)                       34         13                7
  Old(Before 1995)                 79          0               67

Overall Statistics
                                          
               Accuracy : 0.7707          
                 95% CI : (0.7485, 0.7917)
    No Information Rate : 0.7927          
    P-Value [Acc > NIR] : 0.9827          
                                          
                  Kappa : 0.2279          
                                          
 Mcnemar's Test P-Value : 5.024e-10       

Statistics by Class:

                     Class: Modern(1995-2019) Class: New(2019-)
Sensitivity                            0.9050          0.127451
Specificity                            0.2797          0.970672
Pos Pred Value                         0.8277          0.240741
Neg Pred Value                         0.4350          0.938451
Prevalence                             0.7927          0.068000
Detection Rate                         0.7173          0.008667
Detection Prevalence                   0.8667          0.036000
Balanced Accuracy                      0.5924          0.549062
                     Class: Old(Before 1995)
Sensitivity                          0.32057
Specificity                          0.93881
Pos Pred Value                       0.45890
Neg Pred Value                       0.89513
Prevalence                           0.13933
Detection Rate                       0.04467
Detection Prevalence                 0.09733
Balanced Accuracy                    0.62969

Accuracy rate (%)

76

Observations tested

5000

Training/testing split (%)

70-30

Column

kNN Visualization Plot

Confusion Matrix

---
title: "Property Sale Value Analysis"
author: "Marcos Fassio Bazzi, Ariq Rashid"
output: 
  flexdashboard::flex_dashboard:
    orientation: columns
    vertical_layout: fill
    source_code: embed
    theme: yeti
---

```{r setup, include=FALSE}
library(flexdashboard)
library(knitr)
library(olsrr)
library(leaps)
library(faraway)
library(GGally)
library(tidyverse)
library(glmnet)
library(ISLR2)
library(splines)
library(caret)
library(FNN)
library(e1071)
library(gmodels)
library(car)
library(arm)
library(boot)
```

```{r}
##CHANGE THIS WORKING DIRECTORY--------------------
setwd("C:/Users/marco/Desktop/CMDA 4654/Projects/Project 1")
```

```{r}
df <- read.csv("data/property.csv", header=TRUE)
wake_cols <- c(10, 18:21, 24:29, 34, 43:46, 48, 50, 52, 53, 59, 61) # columns i want in cleaned data
useless_APA_cols <- c("Site In Natural State", "Developing Site") # rows i DONT want in cleaned data
useless_land_class_cols <- c("Part Exempt", "State Assessed", "Vacant Land", "Manufactured Home", 
                       "Manufactured Home Park")
useless_ownership_cols <- c("County, Parish, Province, Etc.", "Federal Government")

df_w <- df %>%
    dplyr::select(all_of(wake_cols)) %>%
    mutate_if(is.character, list(~na_if(.,""))) %>%
    filter(Land.Value != 0 & Building.Value != 0 & Land.Sale.Value > 0 & Total.Sale.Value > 0) %>%
    filter(!APA.Site.Description %in% useless_APA_cols) %>% 
    filter(!Land.Class %in% useless_land_class_cols) %>% 
    filter(!APA.Ownership.Description %in% useless_ownership_cols) %>%
    # only one instance found - essentially an exception
    drop_na()
```

Overview
=======================================================================

data summary {data-width=350}
-----------------------------------------------------------------------

### About our Data

Our data contains real estate data from portions of North Carolina's Wake County, Chatham County and Durham County. The dataset contains about 270,000 observations and 63 variables containing real estate data - including but not limited to houses, apartments and commercial property. After cleaning our data, the dataset only contains data from Wake County and contains about 64,000 observations and 22 variables. 

* Source: [data.gov](https://catalog.data.gov/dataset/real-estate-data) 
* Landing page: [townofcary.org](https://data.townofcary.org/explore/dataset/property/information) 
* Raw Data: [Google Drive](https://drive.google.com/file/d/15Gm2mkMeiIoca7K-YL8wG-0lwqsXrdJG/view?usp=sharing).

Our responsibilities are divided as follows:

* Ariq:
    - Natural Cubic Splines
    - k-Nearest Neighbors Classification
    - Ridge Regression
* Marcos:
    - Multiple Linear Regression
    - Naive Bayes Classification
    - Logistic Regression
* Both:
    - Compile dashboard and check each other's work

Table Summary {data-width=650}
-----------------------------------------------------------------------

### Table Summary

|            **Name**           |                              **Description**                              |
|:-----------------------------:|:-------------------------------------------------------------------------:|
|       Calculated.Acreage      |                   Calculated area from property in acres                  |
|           Land.Class          |                      Land classification description                      |
|        Land.Class.Code        |                          Land classification code                         |
|        Total.Structures       |                 Total number of structures on the property                |
|          Total.Units          |                   Total number of units on the property                   |
|         Building.Value        | Revenue Dept. assessed value for structures contained within the property |
|           Land.Value          |    Revenue Dept. assessed value for land contained within the property    |
|        Land.Sale.Value        |                US dollar value for the land when last sold.               |
|         Land.Sale.Date        |                       Date that land was last sold.                       |
|        Total.Sale.Value       |        US dollar value for the land and building(s) when last sold.       |
|        Total.Sale.Date        |             Date that land and/or building(s) when last sold.             |
|             WC.ETJ            |                Corporate limits where property is located.                |
|         Billing.Class         |            Billing classifications for Revenue Department use.            |
|   APA.Ownership.Description   |         American Planning Association (APA) ownership description.        |
|    APA.Activity.Description   |                         APA activity description.                         |
|    APA.Function.Description   |                         APA function description.                         |
|      APA.Site.Description     |                           APA site description.                           |
| Total.Building.Square.Footage |                   Total square footage of the structure.                  |
|    Type.And.Use.Description   |                     Building use and type description                     |
|            Phy.City           |                      City where property is located.                      |
|          Shape.STArea         |                          Property structure area.                         |
|           Year.Built          |                          Year property was built.                         |

**Source: Town of Cary Dataset Schema**

Multiple Linear Regression
=======================================================================

summary {.sidebar}
-----------------------------------------------------------------------

### Summary

What maximizes the total sale value of a piece of property? After comparing all possible subsets and their criterion and accounting for collinearity, the model regressing the total sale value to the calculated acreage, total structures, total units and land sale value is our most accurate model.

However, the model violated all assumptions. Applying a log transform satisfies every assumption but normal distribution. You should note that this model might not produce the most accurate or precise results and should consider nonparametric methods and models.

Side-by-side comparisons are provided in the "Variance", "Normality" and "Leverage" tabs.

Column {data-width=500, .tabset}
-----------------------------------------------------------------------

### Research & Model Selection

```{r}
mlr_cols <- c(1, 4:8, 10, 18)
df_mlr <- df_w %>%
    dplyr::select(all_of(mlr_cols))
```

My goal for this section is to model what maximizes the total sale value of a plot of land from this dataset. Initially I wanted to include both numeric and categorical data in my model but I would run into some problems -- more on that later. Using the cleaned dataset, and all possible subsets, I picked out seven regressors and compared them and their criterion to each other: `Calculated.Acreage`, `Total.Structures`, `Total.Units`, `Building.Value`, `Land.Value`, `Land.Sale.Value`, and `Total.Building.Square.Footage`. The following shows these results.

```{r}
best_subsets <- regsubsets(Total.Sale.Value ~ ., data=df_mlr)
best_subsets_results <- summary(best_subsets)

attach(best_subsets_results)
table <- data.frame(outmat, adjr2, cp, bic)
colnames(table) <- c("Acreage", "Sructures", "Units", "Building", 
                    "Land", "Land Sale", "Sq. Footage", "AdjR2", "Cp", "BIC")
rownames(table) <- c(1, 2, 3, 4, 5, 6, 7)
knitr::kable(table)
detach(best_subsets_results)
```

As shown above, each criterion points to the fact that the full model is the most accurate model. This *seems* like a great fit to the data -- however, it's not. The model's variance inflation factor (VIF) and the data's scatterplot matrix, as shown to the right, proves that there is collinearity between the regressors.

```{r}
full_model <- lm(Total.Sale.Value ~ ., data=df_mlr)
vif(full_model)
```

This calls for a reduced model.

### Reduced Model & Diagnostics

Our new best model regresses `Total.Sale.Value` with `Calculated.Acreage`, `Total.Structures`, `Total.Units`, and `Land.Sale.Value`. The model summary and VIF is shown below.

```{r}
reduced_model <- lm(Total.Sale.Value ~ Calculated.Acreage + Total.Structures + Total.Units + Land.Sale.Value, 
                 data=df_mlr)
summary(reduced_model)
vif(reduced_model)
```

The calculated acreage, the land sale value, and the total number of structures and units have a direct correlation against the total sale value of a property. This model explains about 60.66% of the variance, as seen by the adjusted-$R^2$ value.

However, there's a problem: none of the assumptions for multiple linear regression are satisfied, and a log transform is necessary. The only problem that comes with this is that `Total.Units` cannot be used anymore since a large portion of the data contains zero.

### Final Model

The final model is as follows:

```{r}
final_model <- lm(log(Total.Sale.Value) ~ log(Calculated.Acreage) + log(Total.Structures) + 
                      log(Land.Sale.Value), data=df_mlr)
summary(final_model)
```

Looking at the graphs to the right (aside from collinearity), you can see a side-by-side comparison showing how the log transform has satisfied all assumptions but normality. With this in mind, this model might not produce the most accurate or precise results. Nonparametric methods should be considered when asking this question.

Column {data-width=500 .tabset}
-----------------------------------------------------------------------

### Collinearity

```{r}
temp_df <- df_mlr
colnames(temp_df) <- c("CA", "TS", "TU", "BV", "LV", "LSV", "TSV", "TBSF")
ggpairs(temp_df[1:10000, ]) + # i will NOT let R run 70000 data points 49 times.
    theme(axis.line=element_blank(), axis.text=element_blank(), axis.ticks=element_blank())
```

### Variance

```{r}
par(mfrow=c(2,1))
plot(reduced_model, which=1, pch=20, main="Full Model")
plot(final_model, which=1, pch=20, main="Log Model")
```

### Normality

```{r}
par(mfrow=c(2,1))
plot(reduced_model, which=2, pch=20, main="Full Model")
plot(final_model, which=2, pch=20, main="Log Model")
```

### Leverage

```{r}
par(mfrow=c(2,1))
plot(reduced_model, which=5, pch=20, main="Full Model")
plot(final_model, which=5, pch=20, main="Log Model")
```

Ridge Regression
=================================================

summary {.sidebar}
---------------------------------------------------

### Applying Ridge Regression

The point of this method is to address multicollinearity in a model. With this dataset, a lot of the columns that can be helpful may be correlated with each other. When building our MLR model we had to apply a log transform. I want to use ridge regression to see if we can use the model without any transformations.

Column
-----------------------------------------------------

### Creating the Initial Model

The first step is to create a ridge regression model using the `glmnet` library without any $\lambda$ yet. Below is a summary of said model. In the plot tabs, you can see the plot of the coefficients for this model.

```{r}
mlr_cols <- c(1, 4:8, 10, 18)
df_mlr <- df_w %>%
    dplyr::select(all_of(mlr_cols))
```

```{r}
y <- df_mlr$Total.Sale.Value
selected_columns <- c('Calculated.Acreage', 'Total.Structures', 'Land.Sale.Value')

# Select the columns from the data frame
selected_data <- df_mlr[, selected_columns]

# Convert the selected data to a matrix
x <- as.matrix(selected_data)

ridge_model1 <- glmnet(x, y, alpha = 0)
kable(summary(ridge_model1))
```

### Finding the Best $\lambda$

To find our best $\lambda$ value we will be cross-validating our models. The `glmnet` library has a built-in function for that and the process can be seen visualized under "Plotting Cross-Validated Model". To see if using this improved the previous model, we calculate our adjusted $R^2$ to see how much variation the model takes into account. The coefficients plot with our best $\lambda$ is in the "Plotting Best Model" tab.

```{r}
cv_model1 <- cv.glmnet(x, y, alpha = 0)
best_lambda1 <- cv_model1$lambda.min
best_ridge_model <- glmnet(x, y, alpha = 0, lambda = best_lambda1)


#use fitted best model to make predictions
y_predicted <- predict(ridge_model1, s = best_lambda1, newx = x)

#find SST and SSE
sst <- sum((y - mean(y))^2)
sse <- sum((y_predicted - y)^2)

#find R-Squared
rsq <- 1 - sse/sst

values <- c(best_lambda1, rsq)
```

Below is $\lambda$ value that best fits our data.

```{r}
best_lambda1
```

Below are the coefficients for the model with said $\lambda$.

```{r}
coef(best_ridge_model)
```

Below is the summary of said model.

```{r}
kable(summary(best_ridge_model))
```

The whole point of using this method was to see if we could find an improved model that addresses multicollinearity. After calculating the adjusted $R^2$, it seems that this model is a slight improvement with the same variables and predicted value from the previous model selection.

```{r}
rsq
```

Column {.tabset}
-----------------------------------------------

### First Model Coeffecients Plot

```{r}
plot(ridge_model1)
```

### Plotting Cross-Validated Model

```{r}
plot(cv_model1)
```

### Plotting Best Model

```{r}
plot(ridge_model1, xvar = "lambda")
```

Logistic Regression
=======================================================================

```{r}
df_log2 <- df_w %>%
    filter(Land.Class.Code %in% c("R") & Type.And.Use.Description %in% "SINGLFAM") %>%
    filter(Total.Sale.Value < 832500) %>%
    mutate(Is.Expensive = ifelse(Total.Sale.Value > 489000, 1, 0)) %>%
    dplyr::select(Calculated.Acreage, Total.Structures, Total.Units, Building.Value,
           Land.Value, Total.Building.Square.Footage, Is.Expensive)
```

Column {.sidebar}
-----------------------------------------------------------------------

### Summary

What makes a piece of property expensive? I used the median property value in Wake County, NC as my cutoff point between a property being expensive or not and fit a logistic regression model using all possible subsets. Before picking the best model, I eliminated all the outliers in the dataset.

After accounting for multicollinearity, my final model uses the calculated acreage, land value, total structures and units as the predictors. After predicting the response and link values and fitting my logit function onto the graph, I concluded my model has a 11% misclassification rate after cross-validating my out-of-sample data using the `cv.glm` function.

Column {data-width=500, .tabset}
-----------------------------------------------------------------------

### All Possible Subsets

What makes a piece of property expensive? I wanted to answer this question using logistic regression as my classification model. To answer this question, I created a new Boolean variable titled `Is.Expensive`, where it returns true if the total sale value of the property was over \$489,000, the top 25th percentile in our dataset. Before picking out the best model, I filtered out all property where the total sale value was over $832,500 since anything over that threshold is considered an outlier. Using this new variable, I used all possible subsets to find the best variables for my model. The table showing all possible subsets is shown below:

```{r}
best_subsets <- regsubsets(Is.Expensive ~ ., data=df_log2)
best_subsets_results <- summary(best_subsets)

attach(best_subsets_results)
table <- data.frame(outmat, adjr2, cp, bic)
colnames(table) <- c("Acreage", "Sructures", "Units", "Building", 
                    "Sq. Footage", "Expensive?", "AdjR2", "Cp", "BIC")
rownames(table) <- c(1, 2, 3, 4, 5, 6)
knitr::kable(table)
detach(best_subsets_results)
```

As shown above, while other criterion point to multiple models having the best fit, Mallows' $C_p$ points to the model with five regressors being the best fit. I proceeded with the model classifying expensiveness to calculated acreage, total structures, building value and square footage.

The following shows the summary and VIF of the chosen model.

```{r}
full_log_model <- glm(Is.Expensive ~ Calculated.Acreage + Total.Structures + Building.Value + 
                          Total.Building.Square.Footage, data=df_log2, family="binomial")
summary(full_log_model)
vif(full_log_model)
```

After looking at the scatterplot matrix and the VIF of this model, I feel like there seems to be strong evidence of the building value and square footage to have collinearity. With that in mind, I refit a smaller model without the building value of the property.

```{r}
reduced_log_model <- glm(Is.Expensive ~ Calculated.Acreage + Total.Structures + 
                          Total.Building.Square.Footage, data=df_log2, family="binomial")
summary(reduced_log_model)
vif(reduced_log_model)
```

Now the VIF of the model is extremely small, but the calculated acreage is no longer significant. With this, we can show our final model on the next page.

### Final Model

The final model (and the model the logit graph is based on) is shown below.

```{r}
reduced_log_model <- glm(Is.Expensive ~ Total.Structures + Total.Building.Square.Footage, 
                         data=df_log2, family="binomial")
summary(reduced_log_model)
vif(reduced_log_model)
```

Only the number of structures and the total building square footage are the predictors. With this we can model our logit function, and after running cross-validation, we only have about a 12% misclassification rate, shown below.

```{r}
cv.glm(data = df_log2, glmfit = reduced_log_model, K = 10)$delta[1]
```


Column {data-width=500, .tabset}
-----------------------------------------------------------------------

### Scatterplot Matrix

```{r}
temp <- df_log2
colnames(temp) <- c("CA", "TS", "TU", "BV", "LV", "TBSF", "IE")
ggpairs(temp[1:10000, ]) +
    theme(axis.line=element_blank(), axis.text=element_blank(), axis.ticks=element_blank())
```

### Logit Grpah

```{r}
pihat <- predict(reduced_log_model, type="response")
etahat <- predict(reduced_log_model, type="link")
ggplot(df_log2, aes(x=etahat, y=Is.Expensive)) + theme_bw() +
    geom_point(aes(color=Is.Expensive), size=0.5, position=position_jitter(height=0.005, width=0)) +
    geom_line(aes(x=etahat, y=pihat)) +
    labs(x="Link Prediction", y="P(Expensive)", title="Probability a Piecce of Property is Expensive")
```

Natural Cubic Splines
=======================================================================

summary {.sidebar}
-----------------------------------------------------------------------

### How Land Value affects Total Sale Value

Does the land value affect the total sale value of a property? I used natural cubic splines to investigate the relationship between `Total.Sale.Value` to `Land.Value` and found that the fifth-degree spline is the best fit to the model.

Column {.tabset}
-----------------------------------------------------------------

### Different Degrees

First, I filtered the data so it only includes the lans and land sale value between \$250,000 and $900,000 that way outliers aren't included in our regression model. I also sampled the data to only 3000 observations for legibility purposes. Then, I fit models between one to five degrees of freedom in order to find our best model. Below is a summary of the five models.

```{r}
#Creating dataframe for plot

ns_filtered <- df_w %>%
  filter(Land.Value < 250000, Total.Sale.Value < 900000) %>%
  dplyr::select(Land.Value, Total.Sale.Value)

ns_df <- ns_filtered %>% sample_n(3000)

#Creating NS with different dfs
ns.1 <- lm(Total.Sale.Value ~ ns(Land.Value, df= 1), data = ns_df)
ns.2 <- lm(Total.Sale.Value ~ ns(Land.Value, df= 2), data = ns_df)
ns.3 <- lm(Total.Sale.Value ~ ns(Land.Value, df= 3), data = ns_df)
ns.4 <- lm(Total.Sale.Value ~ ns(Land.Value, df= 4), data = ns_df)
ns.5 <- lm(Total.Sale.Value ~ ns(Land.Value, df= 5), data = ns_df)

summary(ns.1)
summary(ns.2)
summary(ns.3)
summary(ns.4)
summary(ns.5)
```

To determine the best degree of freedom, I calculated each spline's sum of square error by taking the sum of the residuals squared.

```{r}
#Determining which df is best
SSE.1 <- sum(residuals(ns.1)^2)
SSE.2 <- sum(residuals(ns.2)^2)
SSE.3 <- sum(residuals(ns.3)^2)
SSE.4 <- sum(residuals(ns.4)^2)
SSE.5 <- sum(residuals(ns.5)^2)
SSE.1 <- c(SSE.1, SSE.2, SSE.3, SSE.4, SSE.5)

sse_data <- data.frame("df" = c(1,2,3,4,5), "SSE" = SSE.1)
kable(sse_data, caption = "df and SSE for NS models")

```

### Best Model

From the SSE table and the elbow plot, the best model appears to be of degree 5. Below is a summary of the model.

```{r}
summary(ns.5)
```

Column {.tabset}
------------------------------------------

### Plots for Models with Different Degrees

```{r}
ggplot(data = ns_df, aes(x = Land.Value, y = Total.Sale.Value)) +
  geom_point(size = 0.7) +
  geom_smooth(method = "lm", formula = y ~ ns(x, df = 1), se = FALSE, aes(color = "1st")) +
  geom_smooth(method = "lm", formula = y ~ ns(x, df = 2), se = FALSE, aes(color = "2nd")) +
  geom_smooth(method = "lm", formula = y ~ ns(x, df = 3), se = FALSE, aes(color = "3rd")) +
  geom_smooth(method = "lm", formula = y ~ ns(x, df = 4), se = FALSE, aes(color = "4th")) +
  geom_smooth(method = "lm", formula = y ~ ns(x, df = 5), se = FALSE, aes(color = "5th")) +
  scale_color_manual(
    name = "Polynomial Degree",
    breaks = c("1st", "2nd", "3rd", "4th", "5th"),
    labels = c("1st", "2nd", "3rd", "4th", "5th"),
    values = c("yellow3", "hotpink3", "brown4", "springgreen4", "orange3")
  ) +
  theme(legend.position = "bottom") +
  theme_bw() +
  labs(title = "Natural Cubic Splines", x = "Land Value ($)",
       y = "Totl Sale Value ($)")
```

### Elbow Plot

```{r}
ggplot(data = data.frame(df =c(1,2,3,4,5), SSE.1),       
          aes(x=df, y = SSE.1)) + geom_point() + 
          labs(x="Degrees of Freedom", y="SSE", title="Natural Spline Elbow Plot") + geom_line() + theme_bw()
```

### Best Model Plotted

```{r}
ggplot(data = ns_df, aes(x = Land.Value, y = Total.Sale.Value)) +
  geom_point(size = 0.9) +
  geom_smooth(method = "lm", formula = y ~ ns(x, df = 5), se = TRUE, 
              color = "royalblue", size=2) +
  geom_vline(xintercept = attributes(ns(ns_df$Land.Value, df = 3))$knots, 
             linetype = "dashed", color = "orange", size = 1) +
  ggtitle("Natural Cubic Spline on Land Value vs Total Sale Value") +
  labs(x = "Land Value ($)", y = "Total Sale Value ($)") +
  theme_bw()
```

Naive Bayes
=======================================================================

```{r}
nb_cols <- c(1, 2, 4:8, 10, 15) 
df_nb <- df_w %>%
    dplyr::select(all_of(nb_cols)) %>%
    filter(APA.Activity.Description %in% "Household Activities" & 
               Land.Class %in% "Residential < 10 Acres" & Total.Structures == 1) %>%
    mutate(Calculated.Acerage.Class = case_when(Calculated.Acreage > 0.4 ~ "Large", 
                                                Calculated.Acreage <= 0.4 ~ "Small")) %>%
    mutate(Land.Value.Class = case_when(Land.Value < 50000 ~ "<50k",
                                  Land.Value >= 50000 & Land.Value < 100000 ~ "50k-100k",
                                  Land.Value >= 100000 ~ ">100k")) %>%
    mutate(Total.Sale.Class = case_when(Total.Sale.Value < 200000 ~ "<200k",
                                        Total.Sale.Value >= 200000 & Total.Sale.Value < 400000 ~ "200k-400k",
                                        Total.Sale.Value >= 400000 ~ ">400k")) %>%
    mutate(Estimated.Rent = round(abs(Land.Value - Total.Sale.Value) / 36), .before=10) %>%
    mutate(Estimated.Rent.Class = case_when(Estimated.Rent < 5000 ~ "<5k",
                                         Estimated.Rent >= 5000 & Estimated.Rent < 10000 ~ "5k-10k",
                                         Estimated.Rent >= 10000 ~ ">10k"))

n <- createDataPartition(y=df_nb$Estimated.Rent.Class, p=0.7, list=FALSE)
nb_train <- df_nb[n, ]
nb_test <- df_nb[-n, ]

model <- naiveBayes(Estimated.Rent.Class ~ Calculated.Acerage.Class + Land.Value.Class + Total.Sale.Class,
                    data=nb_train, laplace=0.5)
yhat <- predict(model, newdata=nb_test)
confusion_matrix <- confusionMatrix(factor(yhat), factor(nb_test$Estimated.Rent.Class), 
                                    dnn=c("Predicted", "Actual"))
```

Summary {.sidebar}
-----------------------------------------------------------------------

### Summary

I built a Naive Bayes classifier to classify the amount of money a family will pay per month in rent based on acreage, land value and total sale value. Since Naive Bayes is a categorical classifier by nature, I created intervals for each numerical variable and classified my data using this model.

My most accurate model classified the calculated acreage, land value, and total sale value of the property with an accuracy rate of 78%. 

Column
-----------------------------------------------------------------------

### Accuracy rate

```{r}
valueBox("77%", icon="ion-android-checkmark-circle")
```

### Observations tested

```{r}
valueBox(20261, icon="ion-wrench")
```

### Training/testing split

```{r}
valueBox("70-30%", icon="ion-android-clipboard")
```

### Confusion Matrix

```{r}
temp <- as.data.frame(confusion_matrix$table)
colnames(temp)[3] <- "Frequency"
temp$Predicted <- factor(temp$Predicted, levels=rev(levels(temp$Predicted)))

ggplot(temp, aes(Predicted, Actual, fill=Frequency)) + theme_bw() +
    geom_tile() + geom_text(aes(label=Frequency)) +
    scale_fill_gradient(low="white", high="#43c3f4") +
    labs(title="Predicted vs. Actual Estimated Rent per Month ($)")
```

Column
-----------------------------------------------------------------------

### Research Question & Classification Process

Let's pretend that you're a property developer building single-family houses and you bought a piece of land and want to estimate how much money your residents are going to pay per month in rent. Over the span of three years, about how much will they need to pay per month for your net income to break even?

I wanted to view the relationship between the land value, acreage and total sale value of a plot of residential land. Since Naive Bayes is much better suited for categorical data than numeric data, I created intervals for these variables. I would calculate the estimated rent as follows:

$$\hbox{Rent} = \frac{1}{36} \, \left(\hbox{Land Value - Total Sale Value}\right)$$

Below is a table summarizing these intervals:

|   **Interval**   |                     **Description**                    |
|:----------------:|:------------------------------------------------------:|
|    Calculated    |  "Small" if smaller than 0.4 acres, "Large" if greater |
|    Land Value    |  Less than $50k, between $50k and $100k or over $100k  |
| Total Sale Value | Less than $200k, between $200k and $400k or over $400k |
|  Estimated Rent  |    Less than $5k, between $5k and $10k or over $10k    |

With this in mind, the output of the confusion matrix is shown below:

```{r}
confusion_matrix
```

kNN Classification
===============================================================

summary {.sidebar}
-----------------------------------------------------------------------

### Summary

Does the property's age make a difference in the total sale value of a property? I want to see if the year the property was built have a significant effect on the total sale value. I assumed this would have some direct impact as well, since properties and estates gradually become more valuable over time. I wanted to see if based off the Sale and Land value, I could predict what time period the property was built.

Column
---------------------------------------------------------------------------

### The Method

I created a new column, `Time.Period`, which is a categorical variable consisting of three different values: `Old` for property built before 1995, `Modern` for properties built between 1995 and 2019, and `New` for properties built after 2019. For visualization purposes, I reduced the dataset to a sample of 5,000 observations. Then, I proceeded to split the training and testing data in a 70/30% split, running the kNN classifier with our testing data ($k=5$), constructed the confusion matrix and got a classification rate of 76%. 

A problem running this method is that properties are not evenly skewed throughout the years. We can't account for older properties built, so there is a heavy bias towards classifying something as Modern since a good portion of the sales happened after 2019.

```{r}
years_binned <- df_w %>%
  filter(Land.Value < 250000, Total.Sale.Value < 900000) %>%
  dplyr::select(Land.Value, Total.Sale.Value, Year.Built)

years_binned$Time.Period <- ifelse(years_binned$Year < 1995, "Old(Before 1995)",
                            ifelse(years_binned$Year >= 1995 & years_binned$Year <= 2019,
                                   "Modern(1995-2019)","New(2019-)"))

desired_columns <- c('Time.Period','Total.Sale.Value','Land.Value')
columns1_df <- years_binned[, desired_columns]
set.seed((42))
knn_df <- years_binned %>% sample_n(5000)

index <- sample(1:nrow(knn_df), round(nrow(knn_df) * 0.7))
training_df <- knn_df[index, ]
testing_df <- knn_df[-index, ]

# Store the training/testing data features
train_features1 <- training_df[, 1:2]
test_features1 <- testing_df[, 1:2]

# Scale the features
train_features <- scale(train_features1)
test_features <- scale(test_features1)

# Store the actual labels (assuming Total.Sale.Value is a factor)
train_classes <- factor(training_df$Time.Period)
test_classes <- factor(testing_df$Time.Period)

knn_classes <- knn(train = train_features, test = test_features,
cl = train_classes, k = 5)
```

```{r}
confusion_matrix <- confusionMatrix(data = knn_classes, reference = test_classes)
confusion_matrix
```

### Accuracy rate (%)

```{r}
valueBox(76, icon="ion-android-checkmark-circle")
```

### Observations tested

```{r}
valueBox(5000, icon="ion-wrench")
```

### Training/testing split (%)

```{r}
valueBox("70-30", icon="ion-android-clipboard")
```

Column
--------------------------------------------------

### kNN Visualization Plot 

```{r}

plot_data <- data.frame(Feature1 = test_features1$Land.Value, 
                        Feature2 = test_features1$Total.Sale.Value,
                        PredictedClass = knn_classes)

# Create a scatter plot
ggplot(plot_data, aes(x = Feature1, y = Feature2, color = as.factor(PredictedClass))) +
  geom_point() +
  labs(x = "Land Value ($)", y = "Total Sale Value ($)", 
       color = "Predicted Time Periods", title = "kNN Classification of Time Period") +
  theme_classic()  
```

### Confusion Matrix

```{r}
temp <- as.data.frame(confusion_matrix$table)
colnames(temp)[3] <- "Frequency"
temp$Predicted <- factor(temp$Prediction, levels=rev(levels(temp$Prediction)))

ggplot(temp, aes(Predicted, Reference, fill=Frequency)) + theme_bw() +
    geom_tile() + geom_text(aes(label=Frequency)) +
    scale_fill_gradient(low="white", high="#43c3f4") +
    labs(title="Predicted vs. Reference for Time Period")
```